Distance Plot ( B ) Size Plot ( C ) Time

نویسندگان

  • Manoranjan Dash
  • Huan Liu
چکیده

Clustering is an important data exploration task. A prominent clustering algorithm is agglomerative hierarchical clustering. Roughly, in each iteration, it merges the closest pair of clusters. It was rst proposed way back in 1951, and since then there have been numerous mod-iications. Some of its good features are: a natural, simple, and non-parametric grouping of similar objects which is capable of nding clusters of diierent shape such as spherical and arbitrary. But large CPU time and high memory requirement limit its use for large data. In this paper we show that geometric metric (centroid, median, and minimum variance) algorithms obey a 90-10 relationship where roughly the rst 90iterations are spent on merging clusters with distance less than 10the maximum merging distance. This characteristic is exploited by partially overlapping partitioning. It is shown with experiments and analyses that diierent types of existing algorithms beneet excellently by drastically reducing CPU time and memory. Other contributions of this paper include comparison study of multi-dimensional visa -vis single-dimensional partitioning, and analytical and experimental discussions on setting of parameters such as number of partitions and dimensions for partitioning.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Computing alignment plots efficiently

Dot plots are a standard method for local comparison of biological sequences. In a dot plot, a substring to substring distance is computed for all pairs of fixed-size windows in the input strings. Commonly, the Hamming distance is used since it can be computed in linear time. However, the Hamming distance is a rather crude measure of string similarity, and using an alignment-based edit distance...

متن کامل

A Test of the Mean Distance Method for Forest Regeneration Assessment

A new distance-based estimator for forest regeneration assessment, the mean distance method, was developed by combining ideas and techniques from the wandering quarter method, T-square sampling and the random pairs method. The performance of the mean distance method was compared to conventional 4.05 square meter plot sampling through simulation analysis on 405 square meter blocks of a field sur...

متن کامل

Estimation of Optimum Field Plot Size and Shape in Paddy Yield Trial

This paper is to estimate the optimum plot size with the shape for field research experiments on paddy yield trial considering the effect of plot size on variability in yield of crop as well as studying the coefficients of variation of different plot sizes and shapes of plots. The maximum curvature technique and comparable variance methods were exercised to estimate optimum plot size and shape ...

متن کامل

Cycle Plot Revisited: Multivariate Outlier Detection Using a Distance-Based Abstraction

The cycle plot is an established and effective visualization technique for identifying and comprehending patterns in periodic time series, like trends and seasonal cycles. It also allows to visually identify and contextualize extreme values and outliers from a different perspective. Unfortunately, it is limited to univariate data. For multivariate time series, patterns that exist across several...

متن کامل

Evaluation of Simple Methods for Estimating Broad-Sense Heritability in Stands of Randomly Planted Genotypes

hande, 1957; Nyquist, 1991). This approach is based on the assumption that both random and systematic Inexpensive estimates of broad-sense heritability (BSH) may be environmental variation within a planting containing a valuable in plant breeding. This research evaluated two methods for estimating BSH with data from stands of equidistantly spaced genosingle genotype follows an inverse logarithm...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001